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1. INTRODUCTION 

Spatial knowledge discovery aims to find useful and novel patterns in spatial datasets to support 
decision-making in a particular problem domain [1]. Among all the possible patterns to discover, spatial asso- 
ciations are one of the most commonly used today in multiple fields such as climatology, geography, geology, 
criminology and ecology, among many others. They are comprised of predicates that involve spatial objects 
along with spatial and non-spatial relationships between those objects [2]. There are many challenges associ- 
ated with the characteristics of spatial data that make this data mining task more complicated, such as the spatial 
dependency data attributes, the multiplicity of spatial data representation models, the spatial relations between 
data objects and some particular spatial properties such as spatial autocorrelation and spatial heterogeneity [3]. 

Multiple algorithms have been developed for association pattern mining that can be used. Each of 
these algorithms, in general, aims to solve particular concerns about the aforemetioned challenges. The se- 
lection of a proper algorithm has become an arduous activity due to the growing number of new alternatives 
and their variants, specially to inexperienced users. Thus, it is necessary to provide a new process for small or 
medium-size application domains, one that is easy to implement and flexible enough to be adapted to multiple 
contexts. Consequently, this paper proposes a new process for association mining discovery from spatial data 


Journal homepage: http://ournal.uad.ac.id/index.php/TELKOMNIKA 


TELKOMNIKA Telecommun Comput El Control J 1885 


that utilizes graph theory to model spatial objects and the relations between them and frequent subgraph mining 
to find the substructures with a high repetition rate inside the general graph. These substructures correspond to 
association patterns. The proposal is a new alternative to model complex situations from a particular problem 
domain, but not replace or improve results from the algorithms in the state-of-the-art, however it provides a 
road map to initially address a problem. The rest of this paper is arranged as follows: section 2. on the charac- 
teristics of spatial data; section 3. contains association patterns and their characteristics regarding spatial data; 
section 4. includes the proposed process for discovery of spatial associations; a proof of concept using real 
world data is shown in section 5. Lastly, section 6. contains conclusions and future works . 


2. SPATIAL DATA 

Spatial data is a particular type of dependent data. Formally, a spatial database D is a set of spatial 
records D = {T,,T>,--- , Ta} with T; = {S!,$?,--- ,9™,X},X?,...,X"}, where each S* is a spatial 
attribute that stores values about the spatial contexts, and each X : is a non-spatial attributes with values mea- 
sured at particular locations [3, 4]. The non-spatial attributes may be numerical or categorical according to 
the problem domain and the spatial attributes may be specified as coordinates or places (e.g. city name or 
state code). Additionally, there are three basic types of spatial objects: points, used to model specific punctual 
locations in the space; lines, used to model linear extensions such as rivers or roads; and polygons, used to 
represent objects that have a two-dimensional extension in the space, such as regions or states. 

The dependence of non-spatial attributes on spatial ones means that different implicit spatial relations 
can be extracted from data. Let D be a spatial database, a relation R C D? is called spatial if and only if it is 
defined through a binary predicate P(x, y)|x,y € D that involves the spatial attributes from the spatial objects 
x and y. For example, the spatial relation N C D?, with z, y E€ D, defined by the predicate shown in (2), is 
the neighborhood relations between two spatial points using euclidean distance: rNy = > Dist(x,y) <A, 
AE Rt 

These relations can be classified as geometric, if they are related to the principles of euclidean geom- 
etry (e.g. neighbouring relationships); directional, when they refer to relative spatial orientations (e.g. above, 
below, north, east); topological, if they are independent from the concepts of distance and direction and are 
not affected by spatial transformations such as rotation or translation (e.g. intersect, inside), or hybrid, if they 
are related to two or more of the aforementioned types of properties. These relationships can be calculated 
using different methods depending on the problem domain and the class of spatial data used: points, lines or 
polygons [5, 6]. 

On the other hand, two properties are derived from spatial dependence: spatial autocorrelation, 1.e., 
observations of spatially distributed random variables are not location-independent, and spatial heterogeneity, 
1.e., patterns found in some region of the space may not have the same support in other region. Spatial auto- 
correlation refers to the particularity of spatial data to not be distributed independently throughout the space. 
The distribution depends on the characteristics of the data points, the characteristics of the underlying space or 
the spatial neighboring relationships. For example, churches tends to be located near public squares or animal 
tends to travel to locations that contain their food sources [7]. Spatial heterogeneity is related to spatial auto- 
correlation. This phenomenon describes the local nature of spatial patterns, which are subordinated to some 
specific locations. Thus, a spatial pattern, such as association rules, may have a high support value in a region 
and a low support value in a different one. This phenomenon is also known as Simpson’s paradox [8]. All these 
particular characteristics make knowledge extraction from spatial data become a complex activity which not 
only has to consider patterns between data records, but also the implicit relationships between spatial objects. 


3. SPATIAL ASSOCIATIONS 

One of the most common patterns to find in data is the association pattern. An association pattern 
P is defined as an n-ary predicate P = (p1,p2,--- , Pn) with a high probability of occurrence in the dataset. 
Its classic application is the supermarket basket analysis to discover whether or not there is some correlation 
between items that are bought together. An association pattern is referred to as spatial if at least one of its 
atomic predicates pẹ involves a spatial relationship between its variables [2]. For example, in a city C, churches 
and public squares tend to be neighbors: City(C) A Church(X) A PublicSquare(Y ) Mnside( X, C) A 
Inside(Y,C) A Neighbors(X, Y ) 

As shown in the previous example, Inside(X,C), Inside(Y,C) and Neighbors(X, Y) are spatial 
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predicates related to topological and geometric relationships. Many different relations must be taken into 
consideration at the same time to find useful spatial associations. Also, these relations must be calculated in 
local contexts, due to the aforementioned Simpson’s Paradox. 

Multiple efforts have been made in order to find spatial association patterns in spatial databases: [7] 
proposes a method for spatial association mining that consider spatial autocorrelation by using a cell structure; 
[9] focuses on the problem of rule extraction from spatial data with crisp condition attributes and fuzzy deci- 
sions. A rough-fuzzy set based rule extraction model is used to deal with both fuzziness and roughness; [10] 
combines and extend techniques developed in both spatial and fuzzy data mining to deal with the uncertainty 
found in typical spatial data. This proposal uses fuzzy logic to get relevant information from transition areas 
between spatial neighborhoods to spatial association mining and for spatial relationships modelling; [11, 12] 
propose an algorithm for local patterns discovery considering spatial heterogeneity that incorporates a novel 
spatial metric for support evaluation based on event density in a particular area; [13] presents a specially de- 
signed algorithm to discover spatial associations related to El Niño Southern Oscillation (ENSO); [14] applies 
an algorithm that explores multiple spatial objects hierarchies; [15] uses A-Priori-based approaches to find 
spatial association rules; [6, 16] propose using Inductive Logic Programming (ILP) for reach this data mining 
purpose by modelling and stracting high support spatial relations from spatial data. [17] worked with meta- 
heuristics such as genetic algorithms and evolutionary programming; [18] suggested a data-transformation 
approach before using traditional association rule mining algorithms; [19] introduced non-trivial structures 
such as graphs for spatial relationship representation; among others. 

Because of this variety of spatial data mining approaches for association discovery, it is difficult to 
select a proper algorithm or method to be used in small knowledge discovery application contexts. Because of 
this, a unified and general process is required to deal with the aforementioned problems and it has to be flexible 
enough to be adapted to multiple particular situations and easy to implement. 


4. SPATIAL ASSOCIATION DISCOVERY PROCESS 

This work describes a new process for spatial association extraction considering the possibility of 
having multiple relationships between spatial objects of any kind (i.e. points, lines, polygons), and considering 
the spatial autocorrelation and spatial heterogeneity. This process is designed as a first approach to get spatial 
association knowledge from data in particular contexts easy to implement in small or medium-size projects. 
The process Figure 1 is divided into 5 main steps: data preparation (section 4.1.), neighborhood definition 
(section 4.2.), modelling of spatial relationships using graphs (section 4.3.), frequent subgraph mining (section 
4.4.) and evaluation of results (section 4.5.). 


Spatial Association 
Integrated Data Spatial Neighborhoods Graph Model Frequent Subgraphs Patterns 
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Data Neighborhood ; ; Evaluation of 
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Using Graphs Mining 
A 


A 
E 


Data Sources 





Figure 1. Spatial association discovery process 


4.1. Data preparation 

The proposed process starts with a spatial data preparation step. It is necessary to codify the various 
spatial datasets obtained from different sources in different formats, in order to enable the extraction of relations 
between all the data instances in later steps. In general terms, it is not uncommon to have multiple spatial objects 
layers, each of them with a particular representation type and related to a particular scenario from the problem 
domain. On the other hand, two types of datasets must be considered: target datasets, with objects directly 
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related to the problem domain that are going to be present in every association pattern, and relevant datasets, 
that may or may not be related to the target datasets, but add important information that may be useful for 
decision making [20]. 

These data must be prepared by cleaning errors, solving inconsistent and null values, and dealing with 
outliers. New attributes or even new data objects could be generated using the input data. This step requires 
considerable effort and may require many iterations. Thus, it is advisable to implement the process using a 
proper methodology such as CRISP-DM [21]. 


4.2. Neighborhood definition 

As mentioned before, a particular spatial association pattern may have a higher occurrence probability 
in some regions and lower probability in others [8]. For this reason it is preferred to search for this kind of 
pattern locally. For this, we propose defining partitions of the dataset, called neighborhoods in this context, and 
the subsequent execution of the association pattern search algorithm on each of them. 

These neighborhoods can be defined beforehand using knowledge from to the problem domain, or 
using spatial clustering techniques. Using density-based or distance-based spatial clustering algorithms [22— 
24] is suggested due to the First Law of Geography, which states that spatial objects located together are more 
closely related than those that are far away from each other [25, 26]. Nonetheless, there is an issue to consider 
in this step: the limits between neighborhoods may add important information for spatial association mining. 
Thus, the use of fuzzy clustering techniques or flexible boundaries models may be desirable. 


4.3. Modelling of spatial relationships using graphs 

Now, we have to calculate the spatial relations between the target data instances and the instances 
of the relevant dataset from each neighborhood. Depending on the problem domain, different types of spatial 
relations can be calculated: euclidian, topological, directional or hybrid relationships, as mentioned above [6]. 
This might be a step with a high computational cost. 

Graph theory is proposed to model the spatial relationships due to its close relation with first order 
logic and the pattern to find [16]. Graphs are discrete structures consisting of vertices and edges that connect 
these vertices. There are different kinds of graphs, depending on whether edges have directions (digraphs), 
whether multiple edges can connect the same pair of vertices (multigraphs), and whether loops are allowed. 

Formally, a simple graph G = (V, E) consists of V, a nonempty set of vertices (or nodes) and E, a set 
of edges. Each edge has two vertices associated with it, called its endpoints. An edge is said to connect its 
endpoints. To relate each edge to its endpoints, a function ọ : E — {vl € V,v2 € V}, called incidence 
function, is used. A multigraph, on the other hand, is a graph where multiple edges can exist associated with 
the same endpoints. Additionally, each vertex and each edge can be labeled with data related to the represented 
object. This structure can be adapted to multiple scenarios and multiple efficient algorithms can be used to 
extract valuable information such as maximum cliques [27]. 

In the context of this work, multigraphs are used to model spatial objects as vertices and the relations 
between them as edges. A small example can be seen in Figure 2 (a). Two sets of labels and two extra 
functions to asign those labels to the vertices and edges are needed. So, let G be a multigraph without loops 
G = (V, E, L, K, ġ,ı,ķ ) where: V is the vertex set of G, which corresponds to the spatial objects from the 
datasets; Æ is the edge set that corresponds to each calculated relationship between the spatial objects; L is the 
vertex label set with the characteristics of the spatial data objects; K is the edge label set, with the characteristics 
of each spatial relation; 6: E > {x € P(V)/|ax| < 2} is the incidence function; ; C V x Land, C E x K 
are labeling relations. 

The aforementioned structure makes it possible to model multiple different relationships with the 
same endpoints labeled with different attributes. Also, many attributes of spatial objects could be taken into 
consideration. Additionally, it must be noted that loops (i.e. edges with only one endpoint) are not considered 
because their lack of semantics in this context (there are not spatial relationships that involves only one spatial 
object). Fuzzy logic could also be a valuable tool to model the spatial relationships, if the situation requires it 
[10]. More information about fuzzy logic this can be found in [28] 


4.4. Frequent subgraph mining 

To extract spatial associations with a high probability of occurrence, frequent subgraph mining 1s pro- 
posed to be used for each modeled graph. Given a multigraph G = (V, E, L, K, @,; ,ẹ ) like the one described in 
the previous section, the frequent subgraph mining problem in a single multigraph is finding recurring subgraph 


Spatial association discovery process using frequent subgraph mining (Giovanni Daidn Rottoli) 


1888 g ISSN: 1693-6930 


Gi C G, or in other words, a subgraph that has multiple instances in the original graph Figure 2 (b). It must 
be noted that two graphs are isomorphic if all of their vertices and edges are shared including its labels. These 
frequent subgraphs represent the relationships between spatial object types that take place in the space with a 
high occurrence probability. 

Multiple algorithms have been designed for frequent subgraph mining in a single big graph, calcu- 
lating the relevance of a pattern in different ways. Some well-known examples of this are IncGM+, FSSG, 
SUBDUE, among others [29, 30]. A set of frequent subgraphs for each neighborhood is obtained as a result of 
this step and must be analyzed to obtain useful knowledge for decision-making. 


4.5. Evaluation of results 

In the final step, frequent subgraphs translated into n-ary predicates that represent trivial information 
(non-novel patterns) must be filtered. The support and confidence measures can be extracted, selecting the 
metrics that the desicion-maker consider to be more appropiate. This activity could be performed automatically 
or manually by an analyst with knowledge about the problem domain with help from an expert. 





(a) (b) 


Figure 2. (a) Simplified example of spatial relationship modelling using a simple graph. (b) Example of 
frequent subgraph (bottom) found in a simple graph without labels in the edges. 


5. PROOF OF CONCEPT 

The proof of concept presented in this section is intended to show how the proposed process works, 
implemented by different programming and data mining tools. The data used in this example consists of 10 
data files containing the location of facilities in Buenos Aires (Argentina) and its surroundings. These facilities 
include libraries(74), clinics(63), post offices(55), sports halls(50), nightclubs(41), schools(107), gas stations 
(97), churches (125), museums(37) or police stations (93). 

For each of them, in the preparation step of the proposed process, the data files were integrated into 
a single data file of spatial points using QGis (http://qgis.org/). Each spatial point is comprised of two spatial 
attributes, Latitude and Longitude, and one non-spatial attribute, the type of building from the previous list. 
After that, only the points that are located outside Buenos Aires limits were filtered to reduce the search space, 
leaving 742 spatial points Figure 3 (a), (orange). Then, in the neighborhood definition step, the HDDBSCAN 
clustering algorithm [31] from the ’dbscan’ library from R programming language was used on the spatial 
data attributes to generate two neighborhoods with a minimum number of points equal to 50 in each of them 
Figure 3 (a), (blue). Only two neighborhoods were used because of explanatory purposes. 

In the next step, for each of the generated neighborhoods, a geometric relationship between their data 
points was extracted forming a graph with vertices labeled with the type of facility related to each data point 
and edges labeled with the sentence close to” if the adjacent points were less than 150 meters away from 
each other (this value was selected for illustrative purposes only). Thus, two graphs were created: one with 71 
vertices and 45 edges in neighborhood 1, and another with 15 vertices and 11 edges in neighborhood 2. 

To obtain the frequent subgraphs of each of the generated graphs, SUBDUE algorithm was used 
via its implementation in Subdue Graph Miner Software, using the compression rate as support measure. 
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The result was a subgraph as shown in Figure 3 (b), with a compression rate of 15.5% in neighborhood 
1, which was translated into the predicate Post office(x1) A Nightclub(x2) A Close to(21, £2) and two sub- 
graphs in neighborhood 2 , both with a compression rate of 27.2% that was translated into the predicates 
Clinic(x1) A Post office(x2) A Close to(£1, x2) Post office(x1) A Sport hall(x2) A Close to(x1, £2) 


Close ta 
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Post Office Sport Hall 
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(a) (b) 


Figure 3. (a) Spatial neighborhoods generated for the proof of concept using HDBSCAN algorithm; (b) 
Results of the proof of concept. 


5.1. Discussion 

The contributions of the proposed process are, firstly, the possibility of adapting it to multiple scenar- 
ios, due to its flexible underlying structure being based on graphs. Some of the aforementioned methods use 
flexible structures too [6, 16] but the complexity of these methods increases because of the use of techniques 
based on Logic Programming. On the other hand, some other methods do not take into account complex pat- 
terns [19]. Furthermore, the possibility of including valuable information related to the data objects and the 
spatial relations by using labels in the graph representation is also considered. Generally, the data structures 
involved do not take into account complex data associated to the spatial relations between spatial data. 

In relation to the above, the proposed process considers spatial phenomena such as autocorrelation and 
heterogeneity, by using spatial neighborhoods. Some alternatives such as [7] considering spatial autocorrelation 
but not considering spatial heterogeneity or complex data relationships. In most of the cases studied, these 
characteristics are present due to their relevance in data mining. 

Also, related to this, the proposed process allows its implementation by using existing tools such 
as frequent subgraph mining algorithms and clustering algorithms. Some of the state-of-the-art alternatives 
include very flexible and powerful strategies, but implementation is hard, making them not suitable for appli- 
cation in small or medium size projects [6, 9, 16, 19]. Lastly, the high adaptability of the procedure is a desired 
characteristic due to the possibility of selecting among many algorithms for the implementation of each step. 
Usually, the state-of-the-art methods propose a single alternative for its execution. 


6. CONCLUSION 

This work describes a knowledge discovery process called for extraction of spatial associations. 
The process is flexible enough to take into account multiple and varied spatial relationships between spatial 
objects of any kind, using a graph structure to model them. Heterogeneity and autocorrelation phenomena are 
also considered, defining neighborhoods where the search process is performed to find this class of regularity. 
The solution was designed to initially approach to this data mining task without worrying too much about par- 
ticular characteristics of data mining algorithms. In a large-scale project, this process could guide the selection 
of specific methods based on the results obtained in first iterations of an incremental methodology. A proof 
of concept is presented as well, using real data to illustrate how the process is implemented using different 
programming and data mining tools in each of the proposed steps. 
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In future works, the research will be focused on implementation strategies according to the problem 
domain for each of the steps of the process, in order to decrease computational execution time when dealing 
with large amounts of spatial objects and spatial relationships. Also, fuzzy methods will be considered for 
relation modelling and neighborhood definition. 
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